Scalable programming and algorithms for data-intensive life science applications.

نویسنده

  • Judy Qiu
چکیده

Cloud computing [1] offers new approaches for scientific computing that leverage the major commercial hardware and software investment in this area. Closely coupled applications are still unclear in clouds as synchronization costs are still higher than on optimized MPI machines. However loosely coupled problems are very important in many fields and can achieve good cloud performance even when pleasingly parallel steps are followed by reduction operations as supported by MapReduce. It appears that many data analysis problems fit the MapReduce paradigm but there is no definitive analysis here. For example analysis of LHC (Large Hadron Collider) data corresponds to a data selection step followed by forming histograms; this naturally corresponds “perfectly” to the MapReduce paradigm. In Life Science, “all-pairs” applications like BLAST can run well with MapReduce but are particularly simple corresponding to “pleasingly parallel” or “map only” structure. Finally there are applications involving steps like the dimension reduction or clustering algorithms illustrated below where pleasing parallel operations (such as alignment and sequence distance computation) and followed by data mining steps involving iterative operations – such as those present in matrix algebra. Such iterative algorithms are the mainstay of large scale scientific computing and are linked directly to data with data assimilation in weather and climate area [2]. Even in the “birthplace” of MapReduce – Information Retrieval – the Page Rank algorithm needs iterative MapReduce. Thus we pose the following questions.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data Replication-Based Scheduling in Cloud Computing Environment

Abstract— High-performance computing and vast storage are two key factors required for executing data-intensive applications. In comparison with traditional distributed systems like data grid, cloud computing provides these factors in a more affordable, scalable and elastic platform. Furthermore, accessing data files is critical for performing such applications. Sometimes accessing data becomes...

متن کامل

Intelligent scalable image watermarking robust against progressive DWT-based compression using genetic algorithms

Image watermarking refers to the process of embedding an authentication message, called watermark, into the host image to uniquely identify the ownership. In this paper a novel, intelligent, scalable, robust wavelet-based watermarking approach is proposed. The proposed approach employs a genetic algorithm to find nearly optimal positions to insert watermark. The embedding positions coded as chr...

متن کامل

Designing an integrated production/distribution and inventory planning model of fixed-life perishable products

This paper aims to investigate the integrated production/distribution and inventory planning for perishable products with fixed life time in the constant condition of storage throughout a two-echelon supply chain by integrating producers and distributors. This problem arises from real environment in which multi-plant with multi-function lines produce multi-perishable products with fixed life ti...

متن کامل

BlobSeer: Next-generation data management for large scale infrastructures

As data volumes increase at a high speed in more and more application fields of science, engineering, information services, etc., the challenges posed by data-intensive computing gain an increasing importance. The emergence of highly scalable infrastructures, e.g. for cloud computing and for petascale computing and beyond introduces additional issues for which scalable data management becomes a...

متن کامل

Implementing Scalable Parallel Search Algorithms for Data-Intensive Applications

Scalability is a critical issue in the design of parallel software for large-scale search problems. Previous research has not addressed this issue for data-intensive applications. We describe the design of a library for parallel search that focuses on efficient data and search tree management for such applications in distributed computing environments.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Omics : a journal of integrative biology

دوره 15 4  شماره 

صفحات  -

تاریخ انتشار 2011